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ABSTRACT 

Shark is a new data analysis system that marries query process- 
ing with complex analytics on large clusters. It leverages a novel 
distributed memory abstraction to provide a unified engine that 
can run SQL queries and sophisticated analytics functions (e.g., it- 
erative machine learning) at scale, and efficiently recovers from 
failures mid-query. This allows Shark to run SQL queries up to 
100 X faster than Apache Hive, and machine learning programs 
up to 100 X faster than Hadoop. Unlike previous systems. Shark 
shows that it is possible to achieve these speedups while retain- 
ing a MapReduce-like execution engine, and the fine-grained fault 
tolerance properties that such engines provide. It extends such an 
engine in several ways, including column-oriented in-memory stor- 
age and dynamic mid-query replanning, to effectively execute SQL. 
The result is a system that matches the speedups reported for MPP 
analytic databases over MapReduce, while offering fault tolerance 
properties and complex analytics capabilities that they lack. 

1 Introduction 

Modern data analysis faces a confluence of growing challenges. 
First, data volumes are expanding dramatically, creating the need 
to scale out across clusters of hundreds of commodity machines. 
Second, this new scale increases the incidence of faults and strag- 
glers (slow tasks), complicating parallel database design. Third, the 
complexity of data analysis has also grown: modern data analysis 
employs sophisticated statistical methods, such as machine learn- 
ing algorithms, that go well beyond the roll-up and drill-down ca- 
pabilities of traditional enterprise data warehouse systems. Finally, 
despite these increases in scale and complexity, users still expect to 
be able to query data at interactive speeds. 

To tackle the "big data" problem, two major lines of systems 
have recently been explored. The first, composed of MapReduce fT3) 
and various generalizations 1 17 9 1, offers a fine-grained fault toler- 
ance model suitable for large clusters, where tasks on failed or slow 
nodes can be deterministically re-executed on other nodes. MapRe- 
duce is also fairly general: it has been shown to be able to express 
many statistical and learning algorithms 1 1 1 1. It also easily supports 
unstructured data and "schema-on-read." However, MapReduce 
engines lack many of the features that make databases efficient, and 
have high latencies of tens of seconds to hours. Even systems that 
have significantly optimized MapReduce for SQL queries, such as 
Google's Tenzing |9 1, or that combine it with a traditional database 
on each node, such as HadoopDB |3 |, report a minimum latency 
of 10 seconds. As such, MapReduce approaches have largely been 
dismissed for interactive- speed queries [25J , and even Google is 
developing new engines for such workloads p4| . 

Instead, most MPP analytic databases (e.g., Vertica, Greenplum, 
Teradata) and several of the new low-latency engines proposed for 



MapReduce environments (e.g., Google Dremel 11245, Cloudera Im- 
pala 1 1 1) employ a coarser-grained recovery model, where an entire 
query has to be resubmitted if a machine fails[^ This works well 
for short queries where a retry is inexpensive, but faces significant 
challenges in long queries as clusters scale up |3|. In addition, 
these systems often lack the rich analytics functions that are easy 
to implement in MapReduce, such as machine learning and graph 
algorithms. Furthermore, while it may be possible to implement 
some of these functions using UDFs, these algorithms are often 
expensive, furthering the need for fault and straggler recovery for 
long queries. Thus, most organizations tend to use other systems 
alongside MPP databases to perform complex analytics. 

To provide an effective environment for big data analysis, we 
believe that processing systems will need to support both SQL and 
complex analytics efficiently, and to provide fine-grained fault re- 
covery across both types of operations. This paper describes a new 
system that meets these goals, called Shark. Shark is open source 
and compatible with Apache Hive, and has already been used at 
web companies to speed up queries by 40-100 x . 

Shark builds on a recently-proposed distributed shared memory 
abstraction called Resilient Distributed Datasets (RDDs) |33 1 to 
perform most computations in memory while offering fine-grained 
fault tolerance. In-memory computing is increasingly important in 
large-scale analytics for two reasons. First, many complex analyt- 
ics functions, such as machine learning and graph algorithms, are 
iterative, going over the data multiple times; thus, the fastest sys- 
tems deployed for these applications are in-memory |^23j[22|[^. 
Second, even traditional SQL warehouse workloads exhibit strong 
temporal and spatial locality, because more -recent fact table data 
and small dimension tables are read disproportionately often. A 
study of Facebook's Hive warehouse and Microsoft's Bing analyt- 
ics cluster showed that over 95% of queries in both systems could 
be served out of memory using just 64 GB/node as a cache, even 
though each system manages more than 100 PB of total data | 5 1. 

The main benefit of RDDs is an efficient mechanism for fault 
recovery. Traditional main-memory databases support fine-grained 
updates to tables and replicate writes across the network for fault 
tolerance, which is expensive on large commodity clusters. In con- 
trast, RDDs restrict the programming interface to coarse-grained 
deterministic operators that affect multiple data items at once, such 
as map, group-by and join, and recover from failures by tracking the 
lineage of each dataset and recomputing lost data. This approach 
works well for data-parallel relational queries, and has also been 
shown to support machine learning and graph computation |33|. 
Thus, when a node fails. Shark can recover mid-query by rerun- 

^ Dremel provides fault tolerance within a query, but Dremel is 
limited to aggregation trees instead of the more complex commu- 
nication patterns in joins. 
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Figure 1: Performance of Shark vs. Hive/Hadoop on two SQL 
queries from an early user and one iteration of logistic regres- 
sion (a classification algorithm that runs ^10 such steps). Re- 
sults measure the runtime (seconds) on a 100-node cluster. 
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ning the deterministic operations used to build lost data partitions 
on other nodes, similar to MapReduce. Indeed, it typically recovers 
within seconds, by parallelizing this work across the cluster. 

To run SQL efficiently, however, we also had to extend the RDD 
execution model, bringing in several concepts from traditional an- 
alytical databases and some new ones. We started with an exist- 
ing implementation of RDDs called Spark | 33 1, and added several 
features. First, to store and process relational data efficiently, we 
implemented in-memory columnar storage and columnar compres- 
sion. This reduced both the data size and the processing time by 
as much as 5 x over naively storing the data in a Spark program 
in its original format. Second, to optimize SQL queries based on 
the data characteristics even in the presence of analytics functions 
and UDFs, we extended Spark with Partial DAG Execution (PDE): 
Shark can reoptimize a running query after running the first few 
stages of its task DAG, choosing better join strategies or the right 
degree of parallelism based on observed statistics. Third, we lever- 
age other properties of the Spark engine not present in traditional 
MapReduce systems, such as control over data partitioning. 

Our implementation of Shark is compatible with Apache Hive 
p8) , supporting all of Hive's SQL dialect and UDFs and allowing 
execution over unmodified Hive data warehouses. It augments SQL 
with complex analytics functions written in Spark, using Spark's 
Java, Scala or Python APIs. These functions can be combined with 
SQL in a single execution plan, providing in-memory data sharing 
and fast recovery across both types of processing. 

Experiments show that using RDDs and the optimizations above. 
Shark can answer SQL queries up to 100 x faster than Hive, runs it- 
erative machine learning algorithms up to 100 x faster than Hadoop, 
and can recover from failures mid-query within seconds. Figure [T] 
shows three sample results. Shark's speed is comparable to that of 
MPP databases in benchmarks like Pavlo et al.'s comparison with 
MapReduce | 25 1, but it offers fine-grained recovery and complex 
analytics features that these systems lack. 

More fundamentally, our work shows that MapReduce-like exe- 
cution models can be applied effectively to SQL, and offer a promis- 
ing way to combine relational and complex analytics. In the course 
of presenting of Shark, we also explore why SQL engines over pre- 
vious MapReduce runtimes, such as Hive, are slow, and show how 
a combination of enhancements in Shark (e.g., PDE), and engine 
properties that have not been optimized in MapReduce, such as the 
overhead of launching tasks, eliminate many of the bottlenecks in 
traditional MapReduce systems. 

2 System Overview 

Shark is a data analysis system that supports both SQL query pro- 
cessing and machine learning functions. We have chosen to imple- 



Figure 2: Shark Architecture 

ment Shark to be compatible with Apache Hive. It can be used to 
query an existing Hive warehouse and return results much faster, 
without modification to either the data or the queries. 

Thanks to its Hive compatibility. Shark can query data in any 
system that supports the Hadoop storage API, including HDFS and 
Amazon S3. It also supports a wide range of data formats such 
as text, binary sequence files, JSON, and XML. It inherits Hive's 
schema-on-read capability and nested data types | 28 1. 

In addition, users can choose to load high- value data into Shark's 
memory store for fast analytics, as shown below: 

CREATE TABLE latest_logs 

TBLPROPERTIES ( " shark . cache"=true) 
AS SELECT * FROM logs WHERE date > now () -3600; 

Figure [2] shows the architecture of a Shark cluster, consisting of 
a single master node and a number of slave nodes, with the ware- 
house metadata stored in an external transactional database. It is 
built on top of Spark, a modem MapReduce-like cluster computing 
engine. When a query is submitted to the master. Shark compiles 
the query into operator tree represented as RDDs, as we shall dis- 
cuss in Section [241 These RDDs are then translated by Spark into 
a graph of tasks to execute on the slave nodes. 

Cluster resources can optionally be allocated by a cluster re- 
source manager (e.g., Hadoop YARN or Apache Mesos) that pro- 
vides resource sharing and isolation between different computing 
frameworks, allowing Shark to coexist with engines like Hadoop. 

In the remainder of this section, we cover the basics of Spark and 
the RDD programming model, followed by an explanation of how 
Shark query plans are generated and run. 

2.1 Spark 

Spark is the MapReduce-like cluster computing engine used by 
Shark. Spark has several features that differentiate it from tradi- 
tional MapReduce engines |33J : 

1. Like Dryad and Tenzing |T7"9l, it supports general compu- 
tation DAGs, not just the two-stage MapReduce topology. 

2. It provides an in-memory storage abstraction called Resilient 
Distributed Datasets (RDDs) that lets applications keep data 
in memory across queries, and automatically reconstructs it 
after failures | 33 1. 

3. The engine is optimized for low latency. It can efficiently 
manage tasks as short as 100 milliseconds on clusters of 
thousands of cores, while engines like Hadoop incur a la- 
tency of 5-10 seconds to launch each task. 
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Figure 3: Lineage graph for the RDDs in our Spark example. 
Oblongs represent RDDs, while circles show partitions within 
a dataset. Lineage is tracked at the granularity of partitions. 

RDDs are unique to Spark, and were essential to enabling mid- 
query fault tolerance. However, the other differences are important 
engineering elements that contribute to Shark's performance. 

On top of these features, we have also modified the Spark engine 
for Shark to support partial DAG execution, that is, modification 
of the query plan DAG after only some of the stages have finished, 
based on statistics collected from these stages. Similar to pO) , we 
use this technique to optimize join algorithms and other aspects of 
the execution mid-query, as we shall discuss in Section [TT] 

2.2 Resilient Distributed Datasets (RDDs) 

Spark's main abstraction is resilient distributed datasets (RDDs), 
which are immutable, partitioned collections that can be created 
through various data-parallel operators {e.g., map, group-by, hash- 
join). Each RDD is either a collection stored in an external storage 
system, such as a file in HDFS, or a derived dataset created by 
applying operators to other RDDs. For example, given an RDD of 
(visitID, URL) pairs for visits to a website, we might compute an 
RDD of (URL, count) pairs by applying a map operator to turn each 
event into an (URL, 1) pair, and then a reduce to add the counts by 
URL. 

In Spark's native API, RDD operations are invoked through a 
functional interface similar to DryadLINQ |19 | in Scala, Java or 
Python. For example, the Scala code for the query above is: 

val visits = spark . hadoopFile ( "hdfs ://..." ) 
val counts = visits. map(v => (v.url, 1)) 

. reduceByKey ( (a, b) => a + b) 

RDDs can contain arbitrary data types as elements (since Spark 
runs on the JVM, these elements are Java objects), and are au- 
tomatically partitioned across the cluster, but they are immutable 
once created, and they can only be created through Spark's deter- 
ministic parallel operators. These two restrictions, however, enable 
highly efficient fault recovery. In particular, instead of replicating 
each RDD across nodes for fault-tolerance. Spark remembers the 
lineage of the RDD (the graph of operators used to build it), and 
recovers lost partitions by recomputing them from base data | 33 1|^ 
For example. Figure |3] shows the lineage graph for the RDDs com- 
puted above. If Spark loses one of the partitions in the (URL, 1) 
RDD, for example, it can recompute it by rerunning the map on 
just the corresponding partition of the input file. 

The RDD model offers several key benefits our large-scale in- 
memory computing setting. First, RDDs can be written at the speed 
of DRAM instead of the speed of the network, because there is no 

^ We assume that external files for RDDs representing external data 
do not change, or that we can take a snapshot of a file when we 
create an RDD from it. 



need to replicate each byte written to another machine for fault- 
tolerance. DRAM in a modern server is over 10 x faster than even a 
10-Gigabit network. Second, Spark can keep just one copy of each 
RDD partition in memory, saving precious memory over a repli- 
cated system, since it can always recover lost data using lineage. 
Third, when a node fails, its lost RDD partitions can be rebuilt in 
parallel across the other nodes, allowing speedy recovery]^ Fourth, 
even if a node is just slow (a "straggler"), we can recompute nec- 
essary partitions on other nodes because RDDs are immutable so 
there are no consistency concerns with having two copies of a par- 
tition. These benefits make RDDs attractive as the foundation for 
our relational processing in Shark. 

2.3 Fault Tolerance Guarantees 

To summarize the benefits of RDDs explained above. Shark pro- 
vides the following fault tolerance properties, which have been dif- 
ficult to support in traditional MPP database designs: 

1. Shark can tolerate the loss of any set of worker nodes. The 
execution engine will re-execute any lost tasks and recom- 
pute any lost RDD partitions using lineage|^ This is true 
even within a query: Spark will rerun any failed tasks, or 
lost dependencies of new tasks, without aborting the query. 

2. Recovery is parallelized across the cluster. If a failed node 
contained 100 RDD partitions, these can be rebuilt in parallel 
on 100 different nodes, quickly recovering the lost data. 

3. The deterministic nature of RDDs also enables straggler mit- 
igation: if a task is slow, the system can launch a speculative 
"backup copy" of it on another node, as in MapReduce 113) . 

4. Recovery works even in queries that combine SQL and ma- 
chine learning UDFs (Section|4]), as these operations all com- 
pile into a single RDD lineage graph. 

2.4 Executing SQL over RDDs 

Shark runs SQL queries over Spark using a three-step process sim- 
ilar to traditional RDBMSs: query parsing, logical plan generation, 
and physical plan generation. 

Given a query. Shark uses the Hive query compiler to parse the 
query and generate an abstract syntax tree. The tree is then turned 
into a logical plan and basic logical optimization, such as predi- 
cate pushdown, is applied. Up to this point. Shark and Hive share 
an identical approach. Hive would then convert the operator into a 
physical plan consisting of multiple MapReduce stages. In the case 
of Shark, its optimizer applies additional rule-based optimizations, 
such as pushing LIMIT down to individual partitions, and creates 
a physical plan consisting of transformations on RDDs rather than 
MapReduce jobs. We use a variety of operators already present in 
Spark, such as map and reduce, as well as new operators we imple- 
mented for Shark, such as broadcast joins. Spark's master then exe- 
cutes this graph using standard MapReduce scheduling techniques, 
such placing tasks close to their input data, rerunning lost tasks, 
and performing straggler mitigation jSSj. 

While this basic approach makes it possible to run SQL over 
Spark, doing so efficiently is challenging. The prevalence of UDFs 
and complex analytic functions in Shark's workload makes it diffi- 
cult to determine an optimal query plan at compile time, especially 
for new data that has not undergone ETL. In addition, even with 

^ To provide fault tolerance across "shuffle" operations like a par- 
allel reduce, the execution engine also saves the "map" side of the 
shuffle in memory on the source nodes, spilling to disk if necessary. 
^ Support for master recovery could also be added by reliabliy log- 
ging the RDD lineage graph and the submitted jobs, because this 
state is small, but we have not yet implemented this. 



such a plan, naively executing it over Spark (or other MapReduce 
runtimes) can be inefficient. In the next section, we discuss sev- 
eral extensions we made to Spark to efficiently store relational data 
and run SQL, starting with a mechanism that allows for dynamic, 
statistics-driven re-optimization at run-time. 

3 Engine Extensions 

In this section, we describe our modifications to the Spark engine 
to enable efficient execution of SQL queries. 

3.1 Partial DAG Execution (PDE) 

Systems like Shark and Hive are frequently used to query fresh data 
that has not undergone a data loading process. This precludes the 
use of static query optimization techniques that rely on accurate a 
priori data statistics, such as statistics maintained by indices. The 
lack of statistics for fresh data, combined with the prevalent use of 
UDFs, necessitates dynamic approaches to query optimization. 

To support dynamic query optimization in a distributed setting, 
we extended Spark to support partial DAG execution (PDE), a tech- 
nique that allows dynamic alteration of query plans based on data 
statistics collected at run-time. 

We currently apply partial DAG execution at blocking "shuf- 
fle" operator boundaries where data is exchanged and repartitioned, 
since these are typically the most expensive operations in Shark. By 
default. Spark materializes the output of each map task in memory 
before a shuffle, spilling it to disk as necessary. Later, reduce tasks 
fetch this output. 

PDE modifies this mechanism in two ways. First, it gathers cus- 
tomizable statistics at global and per-partition granularities while 
materializing map output. Second, it allows the DAG to be altered 
based on these statistics, either by choosing different operators or 
altering their parameters (such as their degrees of parallelism). 

These statistics are customizable using a simple, pluggable ac- 
cumulator API. Some example statistics include: 

1 . Partition sizes and record counts, which can be used to detect 
skew. 

2. Lists of "heavy hitters," i.e., items that occur frequently in 
the dataset. 

3. Approximate histograms, which can be used to estimate par- 
titions' data's distributions. 

These statistics are sent by each worker to the master, where they 
are aggregated and presented to the optimizer. For efficiency, we 
use lossy compression to record the statistics, limiting their size to 
1-2 KB per task. For instance, we encode partition sizes (in bytes) 
with logarithmic encoding, which can represent sizes of up to 32 
GB using only one byte with at most 10% error. The master can 
then use these statistics to perform various run-time optimizations, 
as we shall discuss next. 

Partial DAG execution complements existing adaptive query op- 
timization techniques that typically run in a single-node system 1 6^ 
[20 30], as we can use existing techniques to dynamically optimize 
the local plan within each node, and use PDE to optimize the global 
structure of the plan at stage boundaries. This fine-grained statis- 
tics collection, and the optimizations that it enables, differentiates 
PDE from graph rewriting features in previous systems, such as 
DryadLINQ 

3.1.1 Join Optimization 

Partial DAG execution can be used to perform several run-time op- 
timizations for join queries. 

Figure|4]illustrates two communication patterns for MapReduce- 
style joins. In shuffle join, both join tables are hash-partitioned by 
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Figure 4: Data flows for map join and shuffle join. Map join 
broadcasts the smaH table to aU large table partitions, while 
shuffle join repartitions and shuffles both tables. 



the join key. Each reducer joins corresponding partitions using a 
local join algorithm, which is chosen by each reducer based on run- 
time statistics. If one of a reducer's input partitions is small, then it 
constructs a hash table over the small partition and probes it using 
the large partition. If both partitions are large, then a symmetric 
hash join is performed by constructing hash tables over both inputs. 

In map join, also known as broadcast join, a small input table is 
broadcast to all nodes, where it is joined with each partition of a 
large table. This approach can result in significant cost savings by 
avoiding an expensive repartitioning and shuffling phase. 

Map join is only worthwhile if some join inputs are small, so 
Shark uses partial DAG execution to select the join strategy at run- 
time based on its inputs' exact sizes. By using sizes of the join 
inputs gathered at run-time, this approach works well even with in- 
put tables that have no prior statistics, such as intermediate results. 

Run-time statistics also inform the join tasks' scheduling poli- 
cies. If the optimizer has a prior belief that a particular join input 
will be small, it will schedule that task before other join inputs and 
decide to perform a map-join if it observes that the task's output is 
small. This allows the query engine to avoid performing the pre- 
shuffle partitioning of a large table once the optimizer has decided 
to perform a map-join. 

3.1.2 Skew-handling and Degree of Parallelism 

Partial DAG execution can also be used to determine operators' 
degrees of parallelism and to mitigate skew. 

The degree of parallelism for reduce tasks can have a large per- 
formance impact: launching too few reducers may overload re- 
ducers' network connections and exhaust their memories, while 
launching too many may prolong the job due to task scheduling 
overhead. Hive's performance is especially sensitive to the number 
of reduce tasks, due to Hadoop's large scheduling overhead. 

Using partial DAG execution. Shark can use individual parti- 
tions' sizes to determine the number of reducers at run-time by co- 
alescing many small, fine-grained partitions into fewer coarse par- 
titions that are used by reduce tasks. To mitigate skew, fine-grained 
partitions are assigned to coalesced partitions using a greedy bin- 
packing heuristic that attempts to equalize coalesced partitions' 
sizes 1 15 1. This offers performance benefits, especially when good 
bin-packings exist. 

Somewhat surprisingly, we discovered that Shark can obtain sim- 
ilar performance improvement by running a larger number of re- 
duce tasks. We attribute this to Spark's low scheduling overhead. 



3.2 Columnar Memory Store 

In-memory computation is essential to low-latency query answer- 
ing, given that memory's throughput is orders of magnitude higher 
than that of disks. Naively using Spark's memory store, however, 
can lead to undesirable performance. Shark implements a columnar 
memory store on top of Spark's memory store. 

In-memory data representation affects both space footprint and 
read throughput. A naive approach is to simply cache the on-disk 
data in its native format, performing on-demand deserialization in 
the query processor. This deserialization becomes a major bottle- 
neck: in our studies, we saw that modern commodity CPUs can 
deserialize at a rate of only 200MB per second per core. 

The approach taken by Spark's default memory store is to store 
data partitions as collections of JVM objects. This avoids deserial- 
ization, since the query processor can directly use these objects, but 
leads to significant storage space overheads. Common JVM imple- 
mentations add 12 to 16 bytes of overhead per object. For example, 
storing 270 MB of TPC-H lineitem table as JVM objects uses ap- 
proximately 971 MB of memory, while a serialized representation 
requires only 289 MB, nearly three times less space. A more seri- 
ous implication, however, is the effect on garbage collection (GC). 
With a 200 B record size, a 32 GB heap can contain 160 million ob- 
jects. The JVM garbage collection time correlates linearly with the 
number of objects in the heap, so it could take minutes to perform 
a full GC on a large heap. These unpredictable, expensive garbage 
collections cause large variability in workers' response times. 

Shark stores all columns of primitive types as JVM primitive 
arrays. Complex data types supported by Hive, such as map and 
array, are serialized and concatenated into a single byte array. 
Each column creates only one JVM object, leading to fast GCs and 
a compact data representation. The space footprint of columnar 
data can be further reduced by cheap compression techniques at 
virtually no CPU cost. Similar to more traditional database systems 
| ,27J , Shark implements CPU-efficient compression schemes such 
as dictionary encoding, run-length encoding, and bit packing. 

Columnar data representation also leads to better cache behavior, 
especially for for analytical queries that frequently compute aggre- 
gations on certain columns. 

3.3 Distributed Data Loading 

In addition to query execution. Shark also uses Spark's execution 
engine for distributed data loading. During loading, a table is split 
into small partitions, each of which is loaded by a Spark task. The 
loading tasks use the data schema to extract individual fields from 
rows, marshals a partition of data into its columnar representation, 
and stores those columns in memory. 

Each data loading task tracks metadata to decide whether each 
column in a partition should be compressed. For example, the 
loading task will compress a column using dictionary encoding 
if its number of distinct values is below a threshold. This allows 
each task to choose the best compression scheme for each partition, 
rather than conforming to a global compression scheme that might 
not be optimal for local partitions. These local decisions do not 
require coordination among data loading tasks, allowing the load 
phase to achieve a maximum degree of parallelism, at the small cost 
of requiring each partition to maintain its own compression meta- 
data. It is important to clarify that an RDD's lineage does not need 
to contain the compression scheme and metadata for each parti- 
tion. The compression scheme and metadata are simply byproducts 
of the RDD computation, and can be deterministically recomputed 
along with the in-memory data in the case of failures. 

As a result. Shark can load data into memory at the aggregated 
throughput of the CPUs processing incoming data. 



Pavlo et al.p5| showed that Hadoop was able to perform data 
loading at 5 to 10 times the throughput of MPP databases. Tested 
using the same dataset used in 1 25 1, Shark provides the same through- 
put as Hadoop in loading data into HDFS. Shark is 5 times faster 
than Hadoop when loading data into its memory store. 

3.4 Data Co-partitioning 

In some warehouse workloads, two tables are frequently joined to- 
gether. For example, the TPC-H benchmark frequently joins the 
lineitem and order tables. A technique commonly used by MPP 
databases is to co-partition the two tables based on their join key in 
the data loading process. In distributed file systems like HDFS, 
the storage system is schema-agnostic, which prevents data co- 
partitioning. Shark allows co-partitioning two tables on a com- 
mon key for faster joins in subsequent queries. This can be ac- 
compHshed with the DISTRIBUTE BY clause: 

CREATE TABLE l_mem TBLPROPERTIES (" shark . cache "=t rue ) 
AS SELECT ^ FROM lineitem DISTRIBUTE BY L_ORDERKEY; 

CREATE TABLE o_mem TBLPROPERTIES ( 

" shark . cache "=true, " copartition"=" l_mem" ) 
AS SELECT ^ FROM order DISTRIBUTE BY 0_ORDERKEY; 

When joining two co-partitioned tables. Shark's optimizer con- 
structs a DAG that avoids the expensive shuffle and instead uses 
map tasks to perform the join. 

3.5 Partition Statistics and Map Pruning 

Data tend to be stored in some logical clustering on one or more 
columns. For example, entries in a website's traffic log data might 
be grouped by users' physical locations, because logs are first stored 
in data centers that have the best geographical proximity to users. 
Within each data center, logs are append-only and are stored in 
roughly chronological order. As a less obvious case, a news site's 
logs might contain news_id and time stamp columns that have 
strongly correlated values. For analytical queries, it is typical to 
apply filter predicates or aggregations over such columns. For ex- 
ample, a daily warehouse report might describe how different visi- 
tor segments interact with the website; this type of query naturally 
applies a predicate on timestamps and performs aggregations that 
are grouped by geographical location. This pattern is even more 
frequent for interactive data analysis, during which drill-down op- 
erations are frequently performed. 

Map pruning is the process of pruning data partitions based on 
their natural clustering columns. Since Shark's memory store splits 
data into small partitions, each block contains only one or few log- 
ical groups on such columns, and Shark can avoid scanning certain 
blocks of data if their values fall out of the query's filter range. 

To take advantage of these natural clusterings of columns. Shark's 
memory store on each worker piggybacks the data loading process 
to collect statistics. The information collected for each partition in- 
clude the range of each column and the distinct values if the num- 
ber of distinct values is small {i.e., enum columns). The collected 
statistics are sent back to the master program and kept in memory 
for pruning partitions during query execution. 

When a query is issued. Shark evaluates the query's predicates 
against all partition statistics; partitions that do not satisfy the pred- 
icate are pruned and Shark does not launch tasks to scan them. 

We collected a sample of queries from the Hive warehouse of a 
video analytics company, and out of the 3833 queries we obtained, 
at least 3277 of them contain predicates that Shark can use for map 
pruning. Section [6]provides more details on this workload. 



def logRegress (point s : RDD [Point]): Vector { 

var w = Vector (D, _ => 2 ^ rand . nextDouble - 1) 
for (i <- 1 to ITERATIONS) { 

val gradient = points. map { p => 

val denom = 1 + exp(-p.y ^ (w dot p.x)) 
(1 / denom -1) ^p.y^p.x 
} . reduce (_ + _) 
w -= gradient 

} 

w 

} 

val users = sql2rdd (" SELECT ^ FROM user u 
JOIN comment c ON c . uid=u . uid" ) 

val features = users .mapRows { row => 

new Vector (extractFeaturel (row . getint ( "age" ) ) , 

extract Feature 2 (row. get St r ( "country" ) ) , 
. . .) } 

val trainedVector = logRegress ( features . cache () ) 



Listing 1: Logistic Regression Example 



4 Machine Learning Support 

A key design goal of Shark is to provide a single system capable 
of efficient SQL query processing and sophisticated machine learn- 
ing. Following the principle of pushing computation to data, Shark 
supports machine learning as a first-class citizen. This is enabled 
by the design decision to choose Spark as the execution engine and 
RDD as the main data structure for operators. In this section, we 
explain Shark's language and execution engine integration for SQL 
and machine learning. 

Other research projects p2l[T4) have demonstrated that it is pos- 
sible to express certain machine learning algorithms in SQL and 
avoid moving data out of the database. The implementation of 
those projects, however, involves a combination of SQL, UDFs, 
and driver programs written in other languages. The systems be- 
come obscure and difficult to maintain; in addition, they may sacri- 
fice performance by performing expensive parallel numerical com- 
putations on traditional database engines that were not designed for 
such workloads. Contrast this with the approach taken by Shark, 
which offers in-database analytics that push computation to data, 
but does so using a runtime that is optimized for such workloads 
and a programming model that is designed to express machine learn- 
ing algorithms. 

4.1 Language Integration 

In addition to executing a SQL query and returning its results. Shark 
also allows queries to return the RDD representing the query plan. 
Callers to Shark can then invoke distributed computation over the 
query result using the returned RDD. 

As an example of this integration. Listing [T] illustrates a data 
analysis pipeline that performs logistic regression over a user database. 
Logistic regression, a common classification algorithm, searches 
for a hyperplane w that best separates two sets of points (e.g. spam- 
mers and non- spammers). The algorithm applies gradient descent 
optimization by starting with a randomized w vector and iteratively 
updating it by moving along gradients towards an optimum value. 

The program begins by using sql2 rdd to issue a SQL query to 
retreive user information as a Table RDD. It then performs feature 
extraction on the query rows and runs logistic regression over the 
extracted feature matrix. Each iteration of logRegress applies a 
function of w to all data points to produce a set of gradients, which 
are summed to produce a net gradient that is used to update w. 



The highlighted map, mapRows, and reduce functions are au- 
tomatically parallelized by Shark to execute across a cluster, and 
the master program simply collects the output of the reduce func- 
tion to update w. 

Note that this distributed logistic regression implementation in 
Shark looks remarkably similar to a program implemented for a 
single node in the Scala language. The user can conveniently mix 
the best parts of both SQL and MapReduce- style programming. 

Currently, Shark provides native support for Scala and Java, with 
support for Python in development. We have modified the Scala 
shell to enable interactive execution of both SQL and distributed 
machine learning algorithms. Because Shark is built on top of the 
JVM, it is trivial to support other JVM languages, such as Clojure 
or JRuby. 

We have implemented a number of basic machine learning al- 
gorithms, including linear regression, logistic regression, and k- 
means clustering. In most cases, the user only needs to supply a 
mapRows function to perform feature extraction and can invoke 
the provided algorithms. 

The above example demonstrates how machine learning compu- 
tations can be performed on query results. Using RDD as the main 
data structure for query operators also enables the possibility of us- 
ing SQL to query the results of machine learning computations in 
a single execution plan. 

4.2 Execution Engine Integration 

In addition to language integration, another key benefit of using 
RDDs as the data structure for operators is the execution engine in- 
tegration. This common abstraction allows machine learning com- 
putations and SQL queries to share workers and cached data with- 
out the overhead of data movement. 

Because SQL query processing is implemented using RDDs, lin- 
eage is kept for the whole pipeline, which enables end-to-end fault 
tolerance for the entire workflow. If failures occur during machine 
learning stage, partitions on faulty nodes will automatically be re- 
computed based on their lineage. 

5 Implementation 

While implementing Shark, we discovered that a number of engi- 
neering details had significant performance impacts. Overall, to 
improve the query processing speed, one should minimize the tail 
latency of tasks and the CPU cost of processing each row. 

Memory-based Shuffle: Both Spark and Hadoop write map out- 
put files to disk, hoping that they will remain in the OS buffer cache 
when reduce tasks fetch them. In practice, we have found that the 
extra system calls and file system journaling adds significant over- 
head. In addition, the inability to control when buffer caches are 
flushed leads to variability in the execution time of shuffle tasks. A 
query's response time is determined by the last task to finish, and 
thus the increasing variability leads to long-tail latency, which sig- 
nificantly hurts shuffle performance. We modified the shuffle phase 
to materialize map outputs in memory, with the option to spill them 
to disk. 

Temporary Object Creation: It is easy to write a program that 
creates many temporary objects, which can burden the JVM's garbage 
collector. For a parallel job, a slow GC at one task may slow the 
entire job. Shark operators and RDD transformations are written in 
a way that minimizes temporary object creations. 

Bytecode Compilation of Expression E valuators: In its current 
implementation. Shark sends the expression evaluators generated 
by the Hive parser as part of the tasks to be executed on each row. 
By profiling Shark, we discovered that for certain queries, when 



data is served out of the memory store the majority of the CPU cy- 
cles are wasted in interpreting these evaluators. We are working on 
a compiler to transform these expression evaluators into JVM byte- 
code, which can further increase the execution engine's throughput. 

Specialized Data Structures: Using specialized data structures 
is another low-hanging optimization that we have yet to exploit. 
For example, Java's hash table is built for generic objects. When 
the hash key is a primitive type, the use of specialized data struc- 
tures can lead to more compact data representations, and thus better 
cache behavior. 

6 Experiments 

We evaluated Shark using four datasets: 

1. Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo et 
al.'s comparison of MapReduce vs. analytical DBMSs |25 |. 

2. TPC-H Dataset: 100 GB and 1 TB datasets generated by the 
DBGEN program pO). 

3. Real Hive Warehouse: 1.7 TB of sampled Hive warehouse 
data from an early industrial user of Shark. 

4. Machine Learning Dataset: 100 GB synthetic dataset to mea- 
sure the performance of machine learning algorithms. 

Overall, our results show that Shark can perform up to 100 x 
faster than Hive, even though we have yet to implement some of the 
performance optimizations mentioned in the previous section. In 
particular. Shark provides comparable performance gains to those 
reported for MPP databases in Pavlo et al.'s comparison |25|. In 
some cases where data fits in memory. Shark exceeds the perfor- 
mance reported for MPP databases. 

We emphasize that we are not claiming that Shark is funda- 
mentally faster than MPP databases; there is no reason why MPP 
engines could not implement the same processing optimizations 
as Shark. Indeed, our implementation has several disadvantages 
relative to commercial engines, such as running on the JVM. In- 
stead, we aim to show that it is possible to achieve comparable per- 
formance while retaining a MapReduce-like engine, and the fine- 
grained fault recovery features that such engines provide. In addi- 
tion. Shark can leverage this engine to perform high-speed machine 
learning functions on the same data, which we believe will be es- 
sential in future analytics workloads. 

6.1 Methodology and Cluster Setup 

Unless otherwise specified, experiments were conducted on Ama- 
zon EC2 using 100 m2 . 4xlarge nodes. Each node had 8 virtual 
cores, 68 GB of memory, and 1.6 TB of local storage. 

The cluster was running 64-bit Linux 3.2.28, Apache Hadoop 
0.20.205, and Apache Hive 0.9. For Hadoop MapReduce, the num- 
ber of map tasks and the number of reduce tasks per node were set 
to 8, matching the number of cores. For Hive, we enabled JVM 
reuse between tasks and avoided merging small output files, which 
would take an extra step after each query to perform the merge. 

We executed each query six times, discarded the first run, and 
report the average of the remaining five runs. We discard the first 
run in order to allow the JVM's just-in- time compiler to optimize 
common code paths. We believe that this more closely mirrors real- 
world deployments where the JVM will be reused by many queries. 

6.2 Pavlo et al. Benchmarks 

Pavlo et al. compared Hadoop versus MPP databases and showed 
that Hadoop excelled at data ingress, but performed unfavorably in 
query execution | 25 1. We reused the dataset and queries from their 
benchmarks to compare Shark against Hive. 
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Figure 5: Selection and aggregation query runtimes (seconds) 
from Pavlo et al. benchmark 

The benchmark used two tables: a 1 GB/node rankings table, 
and a 20 GB/node uservisits table. For our 100-node cluster, we 
recreated a 100 GB rankings table containing 1.8 billion rows and 
a 2 TB uservisits table containing 15.5 billion rows. We ran the 
four queries in their experiments comparing Shark with Hive and 
report the results in Figures [5] and |6] In this subsection, we hand- 
tuned Hive's number of reduce tasks to produce optimal results for 
Hive. Despite this tuning. Shark outperformed Hive in all cases by 
a wide margin. 

6.2.1 Selection Query 

The first query was a simple selection on the rankings table: 

SELECT pageURL, pageRank 

FROM rankings WHERE pageRank > X; 

In 12511, Vertica outperformed Hadoop by a factor of 10 because 
a clustered index was created for Vertica. Even without a clustered 
index. Shark was able to execute this query 80 x faster than Hive 
for in-memory data, and 5 x on data read from HDFS. 

6.2.2 Aggregation Queries 

The Pavlo et al. benchmark ran two aggregation queries: 

SELECT sourcelP, SUM (adRevenue) 
FROM uservisits GROUP BY sourcelP; 

SELECT SUBSTR (sourcelP, 1, 1), SUM ( adRevenue ) 
FROM uservisits GROUP BY SUBSTR ( sourcelP , 1, 7); 

In our dataset, the first query had two million groups and the sec- 
ond had approximately one thousand groups. Shark and Hive both 
applied task-local aggregations and shuffled the data to parallelize 
the final merge aggregation. Again, Shark outperformed Hive by a 
wide margin. The benchmarked MPP databases perform local ag- 
gregations on each node, and then send all aggregates to a single 
query coordinator for the final merging; this performed very well 
when the number of groups was small, but performed worse with 
large number of groups. The MPP databases' chosen plan is similar 
to choosing a single reduce task for Shark and Hive. 

6.2.3 Join Query 

The final query from Pavlo et al. involved joining the 2 TB uservis- 
its table with the 100 GB rankings table. 

SELECT INTO Temp sourcelP, AVG (pageRank) , 
SUM (adRevenue) as totalRevenue 
FROM rankings AS R, uservisits AS UV 
WHERE R.pageURL = UV.destURL 



Shark (disk) | | 
Shark [ | 

Copartitioned O ^ ^ 

I I I I I 

500 1000 1500 2000 

Figure 6: Join query runtime (seconds) from Pavlo benchmark 



AND UV.visitDate BETWEEN Date ( ' 2 0-0 1-1 5 M 
AND Date 2000-01-22M 
GROUP BY UV.sourcelP; 

Again, Shark outperformed Hive in all cases. Figure [6] shows 
that for this query, serving data out of memory did not provide 
much benefit over disk. This is because the cost of the join step 
dominated the query processing. Co-partitioning the two tables, 
however, provided significant benefits as it avoided shuffling data 
2.1 TB of data during the join step. 

6.2.4 Data Loading 

Hadoop was shown by | |25| to excel at data loading, as its data 
loading throughput was five to ten times higher than that of MPP 
databases. As explained in Section |2] Shark can be used to query 
data in HDFS directly, which means its data ingress rate is at least 
as fast as Hadoop' s. 

After generating the 2 TB uservisits table, we measured the time 
to load it into HDFS and compared that with the time to load it into 
Shark's memory store. We found the rate of data ingress was 5x 
higher in Shark's memory store than that of HDFS. 

6.3 Micro-Benchmarks 

To understand the factors affecting Shark's performance, we con- 
ducted a sequence of micro-benchmarks. We generated 100 GB 
and 1 TB of data using the DBGEN program provided by TPC- 
H 1 29 1. We chose this dataset because it contains tables and columns 
of varying cardinality and can be used to create a myriad of micro- 
benchmarks for testing individual operators. 

While performing experiments, we found that Hive and Hadoop 
MapReduce were very sensitive to the number of reducers set for 
a job. Hive's optimizer automatically sets the number of reducers 
based on the estimated data size. However, we found that Hive's 
optimizer frequently made the wrong decision, leading to incredi- 
bly long query execution times. We hand- tuned the number of re- 
ducers for Hive based on characteristics of the queries and through 
trial and error. We report Hive performance numbers for both optimizer- 
determined and hand-tuned numbers of reducers. Shark, on the 
other hand, was much less sensitive to the number of reducers and 
required minimal tuning. 

6.3.1 Aggregation Performance 

We tested the performance of aggregations by running group-by 
queries on the TPH-H lineitem table. For the 100 GB dataset, 
lineitem table contained 600 million rows. For the 1 TB dataset, 
it contained 6 billion rows. 
The queries were of the form: 

SELECT [GROUP_BY_COLUMN] , COUNT ( ^ ) FROM lineitem 
GROUP BY [GROUP_BY_COLUMN] 

We chose to run one query with no group-by column (i.e., a sim- 
ple count), and three queries with group-by aggregations: SHIP- 
MODE (7 groups), RECEIPTDATE (2500 groups), and SHIPMODE 
(150 million groups in 100 GB, and 537 million groups in 1 TB). 
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Figure 8: Join strategies chosen by optimizers (seconds) 

For both Shark and Hive, aggregations were first performed on 
each partition, and then the intermediate aggregated results were 
partitioned and sent to reduce tasks to produce the final aggrega- 
tion. As the number of groups becomes larger, more data needs to 
be shuffled across the network. 

Figure [7] compares the performance of Shark and Hive, measur- 
ing Shark's performance on both in-memory data and data loaded 
from HDFS. As can be seen in the figure. Shark was 80 x faster 
than hand-tuned Hive for queries with small numbers of groups, 
and 20 X faster for queries with large numbers of groups, where the 
shuffle phase domniated the total execution cost. 

We were somewhat surprised by the performance gain observed 
for on-disk data in Shark. After all, both Shark and Hive had to 
read data from HDFS and deserialize it for query processing. This 
difference, however, can be explained by Shark's very low task 
launching overhead, optimized shuffle operator, and other factors; 
see Section[7]for more details. 

6.3.2 Join Selection at Run-time 

In this experiment, we tested how partial DAG execution can im- 
prove query performance through run-time re-optimization of query 
plans. The query joined the lineitem and supplier tables from the 1 
TB TPC-H dataset, using a UDF to select suppliers of interest based 
on their addresses. In this specific instance, the UDF selected 1000 
out of 10 million suppliers. Figure |8] summarizes these results. 

SELECT * from lineitem 1 join supplier s 
ON l.L_SUPPKEY = s.S_SUPPKEY 
WHERE SOME_UDF ( s . S_ADDRESS ) 

Lacking good selectivity estimation on the UDF, a static opti- 
mizer would choose to perform a shuffle join on these two tables 
because the initial sizes of both tables are large. Leveraging partial 
DAG execution, after running the pre- shuffle map stages for both 
tables. Shark's dynamic optimizer realized that the filtered supplier 
table was small. It decided to perform a map-join, replicating the 
filtered supplier table to all nodes and performing the join using 
only map tasks on lineitem. 

To further improve the execution, the optimizer can analyze the 
logical plan and infer that the probability of supplier table being 
small is much higher than that of lineitem (since supplier is smaller 
initially, and there is a filter predicate on supplier). The optimizer 
chose to pre- shuffle only the supplier table, and avoided launching 
two waves of tasks on lineitem. This combination of static query 
analysis and partial DAG execution led to a 3 x performance im- 
provement over a naive, statically chosen plan. 

6.3.3 Fault Tolerance 

To measure Shark's performance in the presence of node failures, 
we simulated failures and measured query performance before, dur- 
ing, and after failure recovery. Figure [9] summarizes fives runs of 
our failure recovery experiment, which was performed on a 50- 
node m2.4xlarge EC2 cluster. 

We used a group-by query on the 100 GB lineitem table to mea- 
sure query performance in the presence of faults. After loading the 
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Figure 7: Aggregation queries on lineitem table. X-axis indicates the number of groups for each aggregation query. 
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Figure 9: Query time with failures (seconds) 



lineitem data into Shark's memory store, we killed a worker ma- 
chine and re-ran the query. Shark gracefully recovered from this 
failure and parallelized the reconstruction of lost partitions on the 
other 49 nodes. This recovery had a small performance impact 3 
seconds), but it was significantly cheaper than the cost of re-loading 
the entire dataset and re-executing the query. 

After this recovery, subsequent queries operated against the re- 
covered dataset, albeit with fewer machines. In Figure [9] the post- 
recovery performance was marginally better than the pre-failure 
performance; we believe that this was a side-effect of the JVM's 
JIT compiler, as more of the scheduler's code might have become 
compiled by the time the post-recovery queries were run. 

6.4 Real Hive Warehouse Queries 

An early industrial user provided us with a sample of their Hive 
warehouse data and two years of query traces from their Hive sys- 
tem. A leading video analytics company for content providers and 
publishers, the user built most of their analytics stack based on 
Hadoop. The sample we obtained contained 30 days of video ses- 
sion data, occupying 1.7 TB of disk space when decompressed. It 
consists of a single fact table containing 103 columns, with heavy 
use of complex data types such as array and struct. The 
sampled query log contains 3833 analytical queries, sorted in or- 
der of frequency. We filtered out queries that invoked proprietary 
UDFs and picked four frequent queries that are prototypical of 
other queries in the complete trace. These queries compute ag- 
gregate video quality metrics over different audience segments: 

1. Query 1 computes summary statistics in 12 dimensions for 
users of a specific customer on a specific day. 

2. Query 2 counts the number of sessions and distinct customer/- 
client combination grouped by countries with filter predi- 
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Figure 10: Real Hive warehouse workloads 

cates on eight columns. 

3. Query 3 counts the number of sessions and distinct users for 
all but 2 countries. 

4. Query 4 computes summary statistics in 7 dimensions group- 
ing by a column, and showing the top groups sorted in de- 
scending order. 

Figure[To|compares the performance of Shark and Hive on these 
queries. The result is very promising as Shark was able to process 
these real life queries in sub-second latency in all but one cases, 
whereas it took Hive 50 to 100 times longer to execute them. 

A closer look into these queries suggests that this data exhibits 
the natural clustering properties mentioned in Section [33] The map 
pruning technique, on average, reduced the amount of data scanned 
by a factor of 30. 

6.5 Machine Learning 

A key motivator of using SQL in a MapReduce environment is the 
ability to perform sophisticated machine learning on big data. We 
implemented two iterative machine learning algorithms, logistic re- 
gression and k-means, to compare the performance of Shark versus 
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Figure 12: K-means clustering, per-iteration runtime (seconds) 

running the same workflow in Hive and Hadoop. 

The dataset was synthetically generated and contained 1 billion 
rows and 10 columns, occupying 100 GB of space. Thus, the fea- 
ture matrix contained 1 billion points, each with 10 dimensions. 
These machine learning experiments were performed on a 100- 
node ml . xlarge EC2 cluster. 

Data was initially stored in relational form in Shark's memory 
store and HDFS. The workflow consisted of three steps: (1) select- 
ing the data of interest from the warehouse using SQL, (2) extract- 
ing features, and (3) applying iterative algorithms. In step 3, both 
algorithms were run for 10 iterations. 

Figures [TT] and [12] show the time to execute a single iteration 
of logistic regression and k-means, respectively. We implemented 
two versions of the algorithms for Hadoop, one storing input data 
as text in HDFS and the other using a serialized binary format. The 
binary representation was more compact and had lower CPU cost 
in record deserialization, leading to improved performance. Our re- 
sults show that Shark is 100 x faster than Hive and Hadoop for lo- 
gistic regression and 30 x faster for k-means. K-means experienced 
less speedup because it was computationally more expensive than 
logistic regression, thus making the workflow more CPU-bound. 

In the case of Shark, if data initially resided in its memory store, 
step 1 and 2 were executed in roughly the same time it took to run 
one iteration of the machine learning algorithm. If data was not 
loaded into the memory store, the first iteration took 40 seconds for 
both algorithms. Subsequent iterations, however, reported numbers 
consistent with Figures[TT]and[T2] In the case of Hive and Hadoop, 
every iteration took the reported time because data was loaded from 
HDFS for every iteration. 

7 Discussion 

Shark shows that it is possible to run fast relational queries in a 
fault-tolerant manner using the fine-grained deterministic task model 
introduced by MapReduce. This design offers an effective way to 
scale query processing to ever-larger workloads, and to combine 
it with rich analytics. In this section, we consider two questions: 
first, why were previous MapReduce-based systems, such as Hive, 
slow, and what gave Shark its advantages? Second, are there other 
benefits to the fine-grained task model? We argue that fine-grained 
tasks also help with multitenancy and elasticity, as has been demon- 
strated in MapReduce systems. 

7.1 Why are Previous MapReduce-Based Systems Slow? 

Conventional wisdom is that MapReduce is slower than MPP databases 
for several reasons: expensive data materialization for fault toler- 



ance, inferior data layout (e.g., lack of indices), and costlier exe- 
cution strategies | 25][26]. Our exploration of Hive confirms these 
reasons, but also shows that a combination of conceptually simple 
"engineering" changes to the engine (e.g., in-memory storage) and 
more involved architectural changes (e.g., partial DAG execution) 
can alleviate them. We also find that a somewhat surprising variable 
not considered in detail in MapReduce systems, the task schedul- 
ing overhead, actually has a dramatic effect on performance, and 
greatly improves load balancing if minimized. 

Intermediate Outputs: MapReduce-based query engines, such as 
Hive, materialize intermediate data to disk in two situations. First, 
within a MapReduce job, the map tasks save their output in case a 
reduce task fails 1 13 1. Second, many queries need to be compiled 
into multiple MapReduce steps, and engines rely on replicated file 
systems, such as HDFS, to store the output of each step. 

For the first case, we note that map outputs were stored on disk 
primarily as a convenience to ensure there is sufficient space to hold 
them in large batch jobs. Map outputs are not replicated across 
nodes, so they will still be lost if the mapper node fails 1 13 1. Thus, 
if the outputs fit in memory, it makes sense to store them in mem- 
ory initially, and only spill them to disk if they are large. Shark's 
shuffle implementation does this by default, and sees far faster shuf- 
fle performance (and no seeks) when the outputs fit in RAM. This 
is often the case in aggregations and filtering queries that return a 
much smaller output than their input[^ Another hardware trend that 
may improve performance, even for large shuffles, is SSDs, which 
would allow fast random access to a larger space than memory. 

For the second case, engines that extend the MapReduce execu- 
tion model to general task DAGs can run multi-stage jobs without 
materializing any outputs to HDFS. Many such engines have been 
proposed, including Dryad, Tenzing and Spark ||T7l[9][33). 

Data Format and Layout: While the naive pure schema-on-read 
approach to MapReduce incurs considerable processing costs, many 
systems use more efficient storage formats within the MapReduce 
model to speed up queries. Hive itself supports "table partitions" (a 
basic index-like system where it knows that certain key ranges are 
contained in certain files, so it can avoid scanning a whole table), as 
well as column-oriented representation of on-disk data | 28 1. We go 
further in Shark by using fast in-memory columnar representations 
within Spark. Shark does this without modifying the Spark runtime 
by simply representing a block of tuples as a single Spark record 
(one Java object from Spark's perspective), and choosing its own 
representation for the tuples within this object. 

Another feature of Spark that helps Shark, but was not present in 
previous MapReduce runtimes, is control over the data partitioning 
across nodes (Section [3l4| . This lets us co-partition tables. 

Finally, one capability of RDDs that we do not yet exploit is ran- 
dom reads. While RDDs only support coarse-grained operations 
for their writes, read operations on them can be fine-grained, ac- 
cessing just one record 1 33 1. This would allow RDDs to be used as 
indices. Tenzing can use such remote-lookup reads for joins (9). 

Execution Strategies: Hive spends considerable time on sorting 
the data before each shuffle and writing the outputs of each MapRe- 
duce stage to HDFS, both limitations of the rigid, one-pass MapRe- 
duce model in Hadoop. More general runtime engines, such as 
Spark, alleviate some of these problems. For instance. Spark sup- 
ports hash-based distributed aggregation and general task DAGs. 

^ Systems like Hadoop also benefit from the OS buffer cache in 
serving map outputs, but we found that the extra system calls and 
file system journalling from writing map outputs to files still adds 
overhead (Section [5j. 



To truly optimize the execution of relational queries, however, 
we found it necessary to select execution plans based on data statis- 
tics. This becomes difficult in the presence of UDFs and complex 
analytics functions, which we seek to support as first-class citizens 
in Shark. To address this problem, we proposed partial DAG execu- 
tion (PDE), which allows our modified version of Spark to change 
the downstream portion of an execution graph once each stage com- 
pletes based on data statistics. PDE goes beyond the runtime graph 
rewriting features in previous systems, such as DryadLINQ 1 19 |, 
by collecting fine-grained statistics about ranges of keys and by 
allowing switches to a completely different join strategy, such as 
broadcast join, instead of just selecting the number of reduce tasks. 

Task Scheduling Cost: Perhaps the most surprising engine prop- 
erty that affected Shark, however, was a purely "engineering" con- 
cern: the overhead of launching tasks. Traditional MapReduce sys- 
tems, such as Hadoop, were designed for multi-hour batch jobs 
consisting of tasks that were several minutes long. They launched 
each task in a separate OS process, and in some cases had a high 
latency to even submit a task. For instance, Hadoop uses periodic 
"heartbeats" from each worker every 3 seconds to assign tasks, and 
sees overall task startup delays of 5-10 seconds. This was sufficient 
for batch workloads, but clearly falls short for ad-hoc queries. 

Spark avoids this problem by using a fast event-driven RPC li- 
brary to launch tasks and by reusing its worker processes. It can 
launch thousands of tasks per second with only about 5 ms of over- 
head per task, making task lengths of 50-100 ms and MapReduce 
jobs of 500 ms viable. What surprised us is how much this affected 
query performance, even in large (multi-minute) queries. 

Sub-second tasks allow the engine to balance work across nodes 
extremely well, even when some nodes incur unpredictable delays 
(e.g., network delays or JVM garbage collection). They also help 
dramatically with skew. Consider, for example, a system that needs 
to run a hash aggregation on 100 cores. If the system launches 100 
reduce tasks, the key range for each task needs to be carefully cho- 
sen, as any imbalance will slow down the entire job. If it could split 
the work among 1000 tasks, then the slowest task can be as much 
as lOx slower than the average without affecting the job response 
time much! After implementing skew-aware partition selection in 
PDE, we were somewhat disappointed that it did not help compared 
to just having a higher number of reduce tasks in most workloads, 
because Spark could comfortably support thousands of such tasks. 
However, this property makes the engine highly robust to unex- 
pected skew. 

In this way. Spark stands in contrast to Hadoop/Hive, where us- 
ing the wrong number of tasks was sometimes 10 x slower than 
an optimal plan, and there has been considerable work to auto- 
matically choose the number of reduce tasks 121] [^. Figure [13] 
shows how job execution times varies as the number of reduce tasks 
launched by Hadoop and Spark. Since a Spark job can launch thou- 
sands of reduce tasks without incurring much overhead, partition 
data skew can be mitigated by always launching many tasks. 
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Figure 13: Task launching overhead 



More fundamentally, there are few reasons why sub- second tasks 
should not be feasible even at higher scales than we have explored, 
such as tens of thousands of nodes. Systems like Dremel |24 | rou- 
tinely run sub- second, multi-thousand-node jobs. Indeed, even if 
a single master cannot keep up with the scheduling decisions, the 
scheduling could be delegated across "lieutenant" masters for sub- 
sets of the cluster. Fine-grained tasks also offer many advantages 
over coarser-grained execution graphs beyond load balancing, such 
as faster recovery (by spreading out lost tasks across more nodes) 
and query elasticity; we discuss some of these next. 

7.2 Other Benefits of the Fine-Grained Task Model 

While this paper has focused primarily on the fault tolerance ben- 
efits of fine-grained deterministic tasks, the model also provides 
other attractive properties. We wish to point out two benefits that 
have been explored in MapReduce-based systems. 

Elasticity: In traditional MPP databases, a distributed query plan 
is selected once, and the system needs to run at that level of par- 
allelism for the whole duration of the query. In a fine-grained task 
system, however, nodes can appear or go away during a query, and 
pending work will automatically be spread onto them. This en- 
ables the database engine to naturally be elastic. If an administrator 
wishes to remove nodes from the engine (e.g., in a virtualized cor- 
porate data center), the engine can simply treat those as failed, or 
(better yet) proactively replicate their data to other nodes if given 
a few minutes' warning. Similarly, a database engine running on a 
cloud could scale up by requesting new VMs if a query is expen- 
sive. Amazon's Elastic MapReduce ||2| already supports resizing 
clusters at runtime. 

Multitenancy: The same elasticity, mentioned above, enables dy- 
namic resource sharing between users. In a traditional MPP database, 
if an important query arrives while another large query using most 
of the cluster, there are few options beyond canceling the earlier 
query. In systems based on fine-grained tasks, one can simply wait 
a few seconds for the current tasks from the first query to finish, 
and start giving the nodes tasks from the second query. For in- 
stance, Facebook and Microsoft have developed fair schedulers for 
Hadoop and Dryad that allow large historical queries, compute- 
intensive machine learning jobs, and short ad-hoc queries to safely 
coexist f32][T8j. 

8 Related Work 

To the best of our knowledge. Shark is the only low-latency system 
that can efficiently combine SQL and machine learning workloads, 
while supporting fine-grained fault recovery. 

We categorize large-scale data analytics systems into three classes. 
First, systems like Hive |28 1, Tenzing \9j, SCOPE |8|, and Chee- 
tah fTO) compile declarative queries into MapReduce- style jobs. 
Even though some of them introduce modifications to the execu- 
tion engine they are built on, it is hard for these systems to achieve 
interactive query response times for reasons discussed in SectionjT] 

Second, several projects aim to provide low-latency engines us- 
ing architectures resembling shared-nothing parallel databases. Such 
projects include PowerDrill 1 16| and Impala 1 1 1. These systems 
do not support fine-grained fault tolerance. In case of mid-query 
faults, the entire query needs to be re-executed. Google's Dremel p4) 
does rerun lost tasks, but it only supports an aggregation tree topol- 
ogy for query execution, and not the more complex shuffle DAGs 
required for large joins or distributed machine learning. 

A third class of systems take a hybrid approach by combining a 
MapReduce-like engine with relational databases. HadoopDB |3] 
connects multiple single-node database systems using Hadoop as 



the communication layer. Queries can be parallelized using Hadoop 
MapReduce, but within each MapReduce task, data processing is 
pushed into the relational database system. Osprey 1 31 1 is a middle- 
ware layer that adds fault-tolerance properties to parallel databases. 
It does so by breaking a SQL query into multiple small queries and 
sending them to parallel databases for execution. Shark presents 
a much simpler single- system architecture that supports all of the 
properties of this third class of systems, as well as statistical learn- 
ing capabilities that HadoopDB and Osprey lack. 

The partial DAG execution (PDE) technique introduced by Shark 
resembles adaptive query optimization techniques proposed in 1 6 
[30 20 1 . It is, however, unclear how these single-node techniques 
would work in a distributed setting and scale out to hundreds of 
nodes. In fact, PDE actually complements some of these tech- 
niques, as Shark can use PDE to optimize how data gets shuf- 
fled across nodes, and use the traditional single-node techniques 
within a local task. DryadLINQ 1 19 | optimizes its number of re- 
duce tasks at run- time based on map output sizes, but does not 
collect richer statistics, such as histograms, or make broader ex- 
ecution plan changes, such as changing join algorithms, like PDE 
can. RoPE |4| proposes using historical query information to opti- 
mize query plans, but relies on repeatedly executed queries. PDE 
works on queries that are executing for the first time. 

Finally, Shark builds on the distributed approaches for machine 
learning developed in systems like Graphlab |22 1, Haloop |7 1, and 
Spark |33|. However, Shark is unique in offering these capabili- 
ties in a SQL engine, allowing users to select data of interest using 
SQL and immediately run learning algorithms on it without time- 
consuming export to another system. Compared to Spark, Shark 
also provides far more efficient in-memory representation of rela- 
tional data, and mid-query optimization using PDE. 

9 Conclusion 

We have presented Shark, a new data warehouse system that com- 
bines fast relational queries and complex analytics in a single, fault- 
tolerant runtime. Shark generalizes a MapReduce-like runtime to 
run SQL effectively, using both traditional database techniques, 
such as column-oriented storage, and a novel partial DAG exe- 
cution (PDE) technique that lets it reoptimize queries at run-time 
based on fine-grained data statistics. This designs enables Shark 
to generally match the speedups reported for MPP databases over 
MapReduce, while simultaneously providing machine learning func- 
tions in the same engine and fine-grained, mid-query fault tolerance 
across both SQL and machine learning. Overall, the system is up 
to 100 X faster than Hive for SQL, and 100 x faster than Hadoop 
for machine learning. 

We have open sourced Shark at shark, cs .berkeley.edu 
and have also worked with two Internet companies as early users. 
They report speedups of 40-100 x on real queries, consistent with 
our results. 
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